Deep learning theory

Numerical computation

Directional derivative

  • The directional derivative in direction $\mu$ (a unit vector) is the slope of he function f in direction u. Formally, it is the derivative of the function $f(x+\alpha\mu)$ with respect to $\alpha$, evaluated at $\alpha = 0$, this is an numerically correct expression, you can verify it by writing the derivative of the function of $\alpha$.
  • Using the chain rule, we can see image-20200915105658574
  • To minimize f, we would like to find the direction in which f decreases the fastest. We can do this by image-20200915105722737

    Where $\theta$ is the angle between $\mu$ and the gradient, substituting in image-20200915105830687 and ignoring factors that do not depend on $\mu$, this simplifies to image-20200915105937469 This is minimized when $\mu$ points in the opposite direction as the gradient.

  • This is known as the method of steepest descend , or gradient descent.

  • Steepest descent proposes a new point image-20200915112118999

Choose step size in different ways

  • Set $\epsilon$ to a small constant.
  • Evaluate image-20200915112408672 for several values of $\epsilon$ and choose the one that results in the smallest objective function value.
  • Line search.

Beyond the gradient: Jacobian and Hessian Matrices

  • Second derivative measures curvature.

  • If the gradient is 1, then we can make a step of size $\epsilon$ along the negative gradient, and the cost function will decrease by more than $\epsilon$ if the second derivative is negative, meaning that the function curves downward.

  • The hessian matrix H(f)(x) is defined such that:

    image-20200915113132517

    Because the hessian matrix is real and symmetric, we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.

  • The second derivative in a specific direction represented by a unit vector d is given by $d^{T}Hd$. This can be easily verified by taking the second derivative of image-20200915115859689 and evaluate at $\alpha = 0$.

  • When d is an eigenvector of H, the second derivative in that direction is given by the corresponding eigenvalue. This can be verified by the definition of eigenvalue Ax = $\lambda$x.

  • For other directions of $d$, the directional second derivative is a weighted average of all the eigenvalues, with weights between 0 and 1, and eigenvectors that have a smaller angle with d receiving more weight. This can be verified by the decomposition of H times unit vector d.

  • The maximum eigenvalue determines the maximum second derivative, and the minimum eigenvalues determine the minimum second derivative.

  • To summary, the second derivative tells us how well we can expect a gradient descent step to perform.

  • Second- order Taylor series approximation to the function f(x) around the current point $x^{(0)}$:

    image-20200915120921881

    Where g is the gradient and H is the Hessian.

  • If we use a learning rate of $\epsilon$, then the new point x will be given by $x^{(0)} - \epsilon g$. Substituting this into our approximation, we obtain image-20200915133101857 There are three terms here: the original value of the function, the expected improvement due to the slope of the function, and the correction we must apply to account for the curvature of the function. When this last term is too large, the gradient descent step can actually move uphill. When $g^{T}Hg$ is zero or negative, the Taylor series approximation predicts that increasing $\epsilon$ forever will decrease f forever.

  • In practice, the Taylor series is unlikely to remain accurate for large $\epsilon$. So one must resort to more heuristic choices of $\epsilon$ in this case. When $g^{T}Hg$ is positive, solving for the optimal step size that decreases the Taylor series yields image-20200915134150977 In the worst case, when g aligns with the eigenvector of H corresponding to the maximal eigenvalue $\lambda_{max}$,(if g aligns with the eigenvector, then Hg = $\lambda$g) then this optimal step size is given by $1/\lambda_{max}$. The eigenvalues of the Hessian thus determine the scale of the learning rate.

  • Second derivative can be used to determine whether a critical point is a local maximum, a local minimum, or a saddle point. In univariate case, it’s trivial, we can draw a a picture of $x^2$ or $-x^2$ to help us understand.

    In multiple dimensions, we can examine the eigenvalues of the Hessian to determine whether the critical point is a local maximum, local minimum or a saddle point.

    When the Hessian matrix is positive definite (all its eigenvalues are positive), the point is local maximum. This can be seen by observing that the directional second derivative in any directional must be positive, and making reference to the univariate case.

    When at least one eigenvalue is positive and at least one eigenvalue is negative, we know that x is a local maximum on one cross section of f but a local minimum on another cross section.

    image-20200916110206336

    The test is inconclusive whenever al the nonzero eigenvalues have the same sign but at least one eigenvalue is zero.

Condition number of the Hessian matrix

  • In multiple dimensions, there is a different second derivative for each direction at a single point.

  • The condition number of the Hessian at this point measures how much the second derivatives differ from each other.

  • When the Hessian has a poor condition number, gradient descent performs poorly. This is because in one direction, the derivative increases rapidly, while in another direction, it increases slowly. Gradient descent is unaware of this change in the derivative. So it does not know that it needs to explore preferentially in the direction where the derivative remains negative for longer.

  • Poor condition number also makes choosing a good step size difficult. The step size must be small enough to avoid overshooting the minimum and going uphill in directions with strong positive curvature. This usually means that the step size is too small to make significant progress in other directions with less curvature.

    image-20200916114810380

    刚开始的时候step size还挺大的,后面由于curvature变得越来越大,步长就越来越小了。image-20200919103543879

Newton’s method

  • We need to solve this condition number problem.

  • This issue can be resolved by using information from the Hessian matrix to guide the search. The simplest method is Newton’s method.

  • The Newton’s method is based on using a second-order Tyaylor series expansion to approximate f(x) near some point $x_{(0)}$:

    image-20200916115727114

    If we then solve for the critical point of this function, we obtain

    image-20200918232903625

Numerical Precision: a deep earning super kill

  • Often deep learning algorithms “sort of work”

  • Often deep learning algorithms “explode”

  • Culprit is often loss of numerical precision

  • Rounding and truncation errors:

    • In a digital computer, we use float32 or similar schemes to represent real numbers.
    • A real number x is rounded to x+delta for some small delta.
    • Overflow: large x replaced by inf
    • Underflow: small x replaced by 0 (very harmful)
  • image-20200919104747621

  • image-20200919104905561

  • Exp function:

    • Overflows for large x
    • Doesn’t need to be very large
    • Float32:89 overflows
    • Never use large x
    • Underflows for very negative x, possibly catastrophic if exp(x) in denominator.
  • Subtraction:

    image-20200919105202962

  • Log and sqrt:

    image-20200919105311566

  • Log exp:

    image-20200919105456570

  • Which is the better hack?

    image-20200919105601015

    The first one is better, because it avoids the derivative of sqrt(x) equals to 0.

  • Log(sum(exp))

    image-20200919105928959

    image-20200919110029920

    Why does the logsumexp trick work?

    Algebraically equivalent to the original version:

    image-20200919114807312

    Miss one log in the last equation……

    image-20200919114938929

  • Softmax

    image-20200919115034258

  • Sigmoid

    image-20200919115202817

  • cross-entropy

    image-20200919223037263

    Recommendations are using built-in library.

  • Bug hunting strategies:

    • image-20200919223239296
  • image-20200919223440481

Machine learning basics

Different tasks of machine learning

  • Classification with missing inputs

    • Learn a set of functions, each function corresponds to classifying x with a different subset of its inputs missing.
    • A more efficient way is to learn a probability distribution over all the relevant variables, then solve the classification task by marginalizing out the missing variables.
  • Transcription

    • Optical character recognition
    • Google street view uses deep learning to process address numbers
    • Speech recognition
  • Machine translation
    • A sequence of symbols in some language converted to a sequence of symbols in another language
  • Structured output
    • The output values should be tightly interrelated, examples are as follows
    • Mapping a natural language sentence into a tree that describes its grammatical structure by tagging nodes of the trees as being verbs, nouns, adverbs and so on
    • Image captioning, the words produced by an image captioning must form a valid sentence
  • Anomaly detection
    • Credit card fraud detection
  • Synthesis and sampling
    • The machine is asked to generate new examples that are similar to those in the training data
    • Video games can automatically generate textures for large objects or landscapes, rather than requiring an artist to manually label each pixel.
    • Given the input, generate the specific kind of output. For example, we provide a written sentence and ask the program to emit an audio waveform containing a spoken version of that sentence.
  • Denoising
    • Predict the clean example x from its corrupted version xhat.
  • Density estimation
    • Learn a function $p_{model}: R^n $ to $R$.

The performance measure P

  • Accuracy: the proportion of examples for which the model produces the correct output
  • Log-probability the model assigns to some examples
  • Sometimes hard to choose a performance measure that corresponds well to the desired behavior of the system.
    • When doing regression task, should we penalize the system more if it frequently makes medium-sized mistakes or if it rarely makes very large mistakes

The experience E

  • Unsupervised learning
  • Superviesed learning
  • Reinforcement learning algorithms interacts with an environment, which does not just experience a fixed dataset.

Generalization

  • It requires our algorithm to perform well on new, previously unseen inputs, not just those on which our model was trained.
  • Test error is generalization error

Occam’s razor

  • This principle states that among competing hypotheses that explain known observations equally well, we should choose the “simplest” one.

Vapnik-Chervonenkis dimension

  • Measure the capacity of a binary classifier.
  • Defined as being the largest possible value of m for which there exists a training set of m different x points that the classifier can label arbitrarily.

Regularization

  • Expressing preferences for one function over another is a more general way of controlling a model’s capacity than including or excluding members from the hypothesis space.
  • We can think of excluding a function from a hypothesis space as expressing an infinitely strong preference against that function.

Validation set and hyperparameters

  • What is hyperparameters?
    • For example, in linear regression, the degree of the polynomial, which acts as a capacity hyperparameter.
  • Why use validation set?
    • Train hyperparameters
  • Why not train hyperparameters using training data?
    • Such hyperparameters would always choose the maximum model capacity, result in overfitting.

Statistics theory basic

  • One can see my another blog: statistics theory

Maximum likelihood estimation

  • image-20200922012615930

  • Because rescale the cost function does not change the argmax.

    image-20200922012653791

  • One way to interpret mle is to view it as minimizing the dissimilarity between the empirical distribution pdata and the model distribution

  • The dissimilarity between the two can be measured by the KL divergence.

    image-20200922012943236

  • Cross entropy:

    Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by model.

Stochastic gradient descent

  • Idea: gradient descent requires computing image-20200923095906722, as the training size grows larger, the time to take a single gradient step becomes long. The insight of SGD is to estimate the expectation using a small set of samples.
  • On each step, sample a mini batch uniformly from the training set. And the estimate of the gradient is formed as image-20200923100152249

Challenges motivating deep learning

  • The curse of dimensionality:
    • In one dimension, when we generalize to a new data point, we usually tel what to do simply by inspecting the training examples that lie in the same cell as the new input
    • When dimension is high, sometimes we cannot find the training data located in that space.
    • We can usually suppose we have a variable taking value of 0 and 1. If we have 100 variables, we need at least 2^100 training data points to have a whole picture of the space.

Local constancy and smoothness regularization

  • Smoothness prior:

    This prior states that the function we learn should not change very much within a small region

  • Many simpler algorithms rely exclusively on this prior to generalize well, and as a result, they fail to scale to the statistical challenges involved in solving AI-level tasks.

  • K-nearest neighbors family of learning algorithms is an extreme example of the local constancy approach.

  • Decision trees also suffer from the limitations of exclusively smoothness-based learning, because:

    • It break the input space into as many regions as there are leaves and use a separate (or more)
    • If a target function requires a tree with at least n leaves to be represented accurately, then at least n training examples are required to fit the tree.
  • If the function is complicated (we want to distinguish a huge number of regions compared to the number of examples), is there any hope to generalize well?

    • Answer is yes
    • The key insight is that a very large number of regions, such as $O(2^k)$, can be defined with $O(k)$ examples, so long as we introduce some dependencies between the regions through additional assumptions about the underlying data-generating distribution.
    • In this way, we can generalize nonlocally.
    • Many different deep learning algorithms provide implicit or explicit assumptions of nonlocality.

Manifold learning

Deep Feedforward Networks

Computational graph and backpropagation

Learn conditional statistics

When we use cross-entropy loss, we are learning p(y|x), actually we only need to learn one conditional statistics of y given x, say E(y|x).

  • image-20200926202307454

    if we are using the square loss, we can replace f(x) by E(y|x) and train the model using gradient descent

  • Here we can use we only need to use the information of E(y|x), not f(y|x), seemed that we are saving some resources.

  • The truth is, the cost function in this way is hard to train. Some output units that saturate produce very small gradients when combined with these cost functions.

Sigmoid units for Bernoulli Output distribution:

  • Intuition of define a probability over y using values of z:

    • we first obtain an unnormalized probability by assuming that log(p(y|z)) = yz, the left hand side can take value from -inf to inf since unnormalized p(y|z)>0 , and yz takes value from -inf to inf, so this assumption is reasonable.

    • Another reason of choosing this assumption is to undo the log under cross entropy.

    • So:

      image-20200926204835485

      image-20200926204857106

    • The symbol appeared in the last equation is called the softplus.

    • image-20200926205249603

Softmax units for multinoulli output distributions

image-20200926213815126

image-20200926213827105

Regularization for deep learning

L2 normalization

  • Aka weight decay

  • Consider a model has the following total objective function:

    image-20201008001403886

    With the corresponding parameter gradient

    image-20201008001446527

    To take a single gradient step to update the weights, we perform this update:

    image-20201008001523308

    Written another way, the update is:

    image-20201008001601908

    We can see that the addition of the weight decay term has modified the learning rule to multiplicatively shrink the weight vector by a constant factor on each step. We now examine what happens over the entire course of training.

  • image-20201011142409424

  • image-20201011142423120

image-20201011142439158

image-20201011142449028

L1 regularization

http://www.deeplearningbook.org/contents/regularization.html

Dataset Augmentation

  • Image classification, for example rotate the image

  • Injecting noise to the neural network: dropout (injecting noise to the hidden units)

  • Noise applied to the weights:

    image-20201011181549757

  • Injecting noise at the output Targets

    image-20201011201843866

Semi-supervised learning

Multitask learning

Early stopping

  • The phenomenon that training objective decreases consistently over time, but the validation set average loss eventually begins to increase again, forming an asymmetric U-shaped curve.
  • image-20200926225245821
  • consider the training time as a hyperparameter. Early stopping algorithm is to tune this parameter.
  • do the training process and periodically record the model performance on validation set. If the performance decreases, we stop the training.
  • image-20200926225932268
  • early stopping is kind of a regularization method because it mathematically has the same effect as L2 regularizations.

parameter sharing

  • from knowledge of the domain and model architecture, that there should be some dependencies between the model parameters
  • a typical example is CNN

Dropout

  • Dropout can be thought of as a method of making bagging practical for ensembles of very many large neural networks
  • Intuition: cannot rely on any one feature, so have to spread out weights, resulting in shrinking weights like L2 norm.
  • Frequently used in computer vision where input is huge.
  • Dropout trains not just a bagged ensemble of model, but an ensemble of models that share hidden units. This means each hidden unit must be able to perform well regardless of which other hidden units are in the model. Hidden units must be prepared to be swapped and interchanged between models.
  • The noise is multiplicative. If the noise were additive with fixed scale, then a relu hidden unit $h_i$ with added noise $\epsilon$ could simply learn to have $h_i$ become very large in order to make the added noise $\epsilon$ insignificant by comparison.

Sparse Representations

  • L1 penalization induces a sparse parametrization, meaning that many of the parameters become zero.

    image-20201007163307987

  • Representational sparsity, on the other hand, describes a representation where many of the elements of the representations are zero.

    image-20201007163540259

  • Just as an L1 penalty on the parameters induces parameter sparsity, an L1 penalty on the elements of the representation induces representational sparsity.

Bagging and other ensemble methods

  • Bagging is a technique for reducing generalization error by combining several models. The idea is to train several different models separately, then have all the models vote on the output for the test examples.

  • Consider for example a set of k regression models.

    image-20201007211337587

Adversarial Training

https://www.youtube.com/watch?v=CIfsB_EYsVI&t=1348s

  • You want to make some data that will have very low performance (fool the network). This is actually an easy task to do.

  • It is because attacking a linear model is simple.

  • Modern deep nets are very piecewise linear.

  • The fast gradient sign method is very popular to generate the adversarial examples:

    • image-20201006221107960

      这里用到了sign gradient,含义是如果往一个$\delta J(x)/\delta x_i$的值是正的,那么sign()以后的值就为1

    • Adversarial examples are not noise

    • Cross-technique transferability allows you to use the adversarial examples developed on your own training algorithms to fool others’ network

    • image-20201006221706173

    • If you impose this adversarial attack on the autonomous vehicles, you can fool them to think the passengers as navigable road, which is super-dangerous!

    • Training on adversarial examples can help.

    • Another technique is called virtual adversarial training.

      image-20201006232114148

    • Adversarial training method can also be applied to universal engineering machine. We can make new inventions by finding input that maximizes model’s predicted performance.

    • Conclusion:

      • Attacking is easy.
      • Defending is difficult.
    • Open-source library cleverhans:

      image-20201006232543162

Deep learning training tips

testing the result on training data set

deeper does not imply better

Vanishing gradient problem

  • near input:

    • smaller gradients
    • learn very slow
    • almost random
  • near output:

    • larger gradients
    • learn very fast
    • already converge
  • use relu, not sigmoid

    • relu: a thinner linear network, do not have smaller gradients
    • sigmoid: vanishing gradient
    • some variations of relu: leaky relu, parametric relu, maxout(learnable activation function)
  • maxout:
    • activation function in maxout network can be any piecewise linear convex function
    • how many pieces depending on how many elements in a group
    • how to train: train a thin and linear network. Different thin and linear network for different examples

Adaptive learning rate

  • adagrad: we give smaller learning rate when the historical gradients are large. 陡峭的地方一直陡峭,平坦的地方一直平坦。

  • rmsprop: image-20200926124712058

    新看到的gradient比较大的weight,过去的gradient给比较小的weight。

Hard to find optimal network parameters

  • very slow at the plateau
  • stuck at saddle point
  • stuck at local minima

Momentum

  • Intuition from the physical world: how about put momentum phenomenon into gradient descent

  • Momentum: movement of last step minus gradient at present (惯性)

    image-20200926171450571

  • Movement = negative of gradient + Momentum

Adam

Combining rmsprop and momentum

Regularization

  • new loss function to be minimized
  • L1 regularization: always delete
  • L2 regularization: always scale

Early stopping

Dropout

  • training: each time before updating the parameter:

    • each neuron has p% to dropout
  • Intuitive reason:

    • when teams up, if everyone expect the partner will do the work, nothing will be done finally.
    • However, if you know your partner will dropout, you will do better.
    • When testing, no one dropout actually, so obtaining good results eventually.
  • Why the weights should multiply (1-p)% (dropout rate) when testing?

    • Assume dropout rate is 50%:

      image-20200926191233011

Sequence Modeling : Recurrent and Recursive Nets

Unfolding computational graphs

Recurrent Neural Network

  • Some examples of important design patterns for RNN:

    • Recurrent networks that produce an output at each time step and have recurrent connections between hidden units.

      image-20201031140518476

    • Recurrent networks that produce an output at each time step and have recurrent connections only from the output at one time step to the hidden units at the next time step.

      image-20201031140637711

    • Recurrent networks with recurrent connections between hidden units, that read an entire sequence and then produce a single output.

      image-20201031140724539

  • Forward propagation equations for the RNN in figure 10.3

    • We assume the hyperbolic tangent activation function

    • Apply softmax operation to obtain $\hat{y}$, which is a normalized probabilities over the output.

    • Begin with a specification of the initial state $h^{0}$. Then from t = 1 to t = $\tau$, we apply the following update equations:

      image-20201031141618907

    • If this is an example of a recurrent network that maps an input sequence to an output sequence of the same length. The total loss for a given sequence of x values paired with a sequence of y values would then be just the sum of the losses over all the time steps.

      image-20201031141952841

    • Procedure:

      • We start the recursion with the nodes immediately preceding the final lossimage-20201031161603086

      • Then image-20201031161709728

      • image-20201031162042560

      • image-20201031162204809

        image-20201031162258051

      • Once the gradients on the internal nodes of the computational graph are obtained, we can obtain the gradients on the parameter nodes.

      • However, the image-20201031162824096 operator used in calculus takes into account the contribution of W to the value of f due to all edges in the computational graph. To resolve this ambiguity, we introduce dummy variables $W^{(t)}$ that are defined to be copies of W but with each used at time step t.

        image-20201031163000377

  • Teacher Forcing and Networks with Output Recurrence: basically used for figure 10.4

    • The network with recurrent connections only from the output figure 10.4 is less powerful

      • Because it lacks hidden-to-hidden recurrent connections.
      • it requires that the output units capture all of the information about the past that the network will use to predict the future.
      • Because the output units are trained to match the training set targets, they are unlikely to capture the necessary information about the past history of the input.
    • The advantage of eliminating hidden-to-hidden recurrence is that

      • For any loss function based on comparing the prediction at time t to the training target at time t, all the time steps are decoupled.
      • Training can then be parallelized, with the gradient for each step t computed in isolation.
      • There is no need to compute the output for the previous time step first, because the training set provides the ideal value of that output.
    • Models that have recurrent connections from their outputs leading back into the model may be trained with teacher forcing.

      • When training, using training set $y^{(t)}$ instead of $o^{(t)}$.

      • When testing, using model’s output $o^{(t)}$

        image-20201031160155862

    • Teacher forcing is originally motivated by avoiding back-propagation through time.

    • The disadvantage of teacher forcing arises when the model is deployed, e.g we feed the output back into the model, the network will see quite different inputs from training time. One way to mitigate the problem is to train with both teacher-forced inputs and with free-running inputs, for example by predicting the correct target a number of steps in the future through the unfolded recurrent output-to-input paths.
  • Recurrent Networks as directed graphical models

    • Let us consider a sequence of data points Y. The joint distribution of these observations: image-20201031193555416
    • The negative log-likelihood of a set of values according to such a model is
    • It is common to make the Markov assumption that the graphical model should only contain edges from image-20201031193738167, however, in some cases, we believe that the distribution over $y^{(t)}$ depends on a value of $y^{(i)}$ from the distant past.
    • Let us consider the case where the RNN models only a sequence of scalar random variables Y , with no additional inputs x. The RNN then defines a directed graphical model over the y variables.
    • To model such relationship, we do feed the actual y values back into the network, so the directed graphical model contains edges from all $y^{(i)}$ values in the past to the current $y^{(t)}$ value.

    image-20201031193619822

    • One way to interpret an RNN as a graphical model is to view the RNN as a complete graph, able to represent direct dependencies between any pair of y.

      image-20201031193254077

    • If we consider the graphical model structure of RNN that results from regarding the hidden units $h^{(t)}$ as random variables. It is actually a very efficient way of parametrization.image-20201031194223207

    • The price RNN pay for their reduced number of parameters is that optimizing the parameters may be difficult.

    • To complete the view of an RNN as a graphical model, we must describe how to draw samples from the model

      • We need to sample from the conditional distribution at each time step
      • The RNN must have some mechanism for determining the length of the sequence, List some as follows:
        • By adding a special symbol corresponding to the end of a sequence. When the symbol is generated, the sampling process stops
        • Introduce an extra Bernoulli output to the model that represents the decision to either continue generation or halt generation at each tie step
        • Add an extra output to the model that predicts the integer $\tau$ itself.
  • Modeling Sequences Conditioned on Context with RNNs

    • Previously, we have discussed RNNs that take a sequence of vectors $x^{(t)}$ for $t = 1,\ldots,\tau$ as input. Another option is to take only a single vector x as input. A first and most common approach is illustrated in figure 10.9

    image-20201031230608446

    • Rather than receiving only a single vector x as input, the RNN may receive a sequence of vectors $x^{(t)}$ as input. The RNN described in equation 10.8 corresponds to a conditional distribution image-20201031230947048 that makes a conditional independence assumption that this distribution factorizes as

image-20201031231020288

​ To remove the conditional independence assumption, we can add connections from the output at time t to the hidden unit at time t+1, as shown in figure 10.10. The model can then represent arbitrary probability distributions over the y sequence.

image-20201031231323603

  • The model described in Figure 10.10 still has one restriction, which is that the length of both sequences must be the same.

Bidirectional RNNs

  • Bidirectional RNNs combine an RNN that moves forward through time beginning from the start of the sequence with another RNN that moves backward through time beginning from the end of the sequence. See Figure 10.11

    image-20201101010842484

  • The idea can be naturally extended to 2-dimensional input, such as images, by having four RNNs, each one going in one of the four directions: up, down, left , right.
  • Application: speech recognition, handwriting recognition, bioinformatics

Encoder-decoder sequence-to-sequence architectures

  • How RNN can be trained to map an input sequence to an output sequence which is not necessarily of the same length.

    image-20201101103258972

  • The idea is very simple

    • An encoder or reader or input RNN processes the input sequence. The encoder emits the context C
    • A decoder or writer or output RNN is conditioned on that fixed-length vector to generate the output sequence Y
    • The two RNNs are trained jointly to maximize the average of image-20201105095013672 over all the pairs of x and y sequences in the training set
  • One clear limitation of this architecture is when the context C output by the encoder RNN has a dimension that is too small to properly summarize a long sequence.

  • To mitigate the above problem, an attention mechanism that learns to associate elements of the sequence C to elements of the output sequence

Deep Recurrent Networks

  • Experimental evidence strongly suggests we introduce depth in each of these operations

  • image-20201101102757960

Recursive Neural Networks

  • image-20201101102732756

The Challenge of Long-Term Dependencies

  • The vanishing and exploding gradient problem: consider image-20201105154912798 , if image-20201105154937755, any component of $h^{(0)}$ that is not aligned with the largest eigenvector will eventually be discarded.

Echo State Networks

Leaky Units and Other Strategies for Multiple Time scales

The LSTM and other gated RNNs

  • LSTM:

    image-20201101101728778

Optimization for Long-Term Dependencies

  • Clipping Gradients

  • One option is to clip the parameter gradient from a mini batch element-wise just before the parameter update. Another is to clip the norm image-20201107201635536 of the gradient g just before the parameter update.

  • image-20201107202124132

  • image-20201101101829989

Explicit Memory

  • Neural network excel at storing implicit knowledge, However, it struggle to memorize facts.
    • Implicit message: how to walk, or how a dog looks different from a cat
    • Explicit message: I have a meeting with Susan at 11:00 am on Sunday

Applications

  • Handwriting recognition using bidirectional LSTMs, data is collected from an interactive whiteboard. Sensors record the (x,y coordinates of the pen at regularly sampled time steps.

RNN in spatial statistics